Exercise: Train a simple linear regression model¶

In this exercise, we'll train a simple linear regression model to predict body temperature based on dogs' ages and interpret the result.

Loading data¶

Let's begin by having a look at our data.

In [1]:
import pandas
!pip install statsmodels
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/doggy-illness.csv

# Convert it into a table using pandas
dataset = pandas.read_csv("doggy-illness.csv", delimiter="\t")

# Print the data
print(dataset)
Requirement already satisfied: statsmodels in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (0.11.0)
Requirement already satisfied: pandas>=0.21 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (1.1.5)
Requirement already satisfied: numpy>=1.14 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (1.21.6)
Requirement already satisfied: patsy>=0.5 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (0.5.2)
Requirement already satisfied: scipy>=1.0 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from statsmodels) (1.5.3)
Requirement already satisfied: python-dateutil>=2.7.3 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas>=0.21->statsmodels) (2.8.2)
Requirement already satisfied: pytz>=2017.2 in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from pandas>=0.21->statsmodels) (2022.1)
Requirement already satisfied: six in /anaconda/envs/azureml_py38/lib/python3.8/site-packages (from patsy>=0.5->statsmodels) (1.16.0)
--2023-08-23 12:43:36--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/graphing.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21511 (21K) [text/plain]
Saving to: ‘graphing.py’

graphing.py         100%[===================>]  21.01K  --.-KB/s    in 0s      

2023-08-23 12:43:36 (94.3 MB/s) - ‘graphing.py’ saved [21511/21511]

--2023-08-23 12:43:38--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/doggy-illness.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3293 (3.2K) [text/plain]
Saving to: ‘doggy-illness.csv’

doggy-illness.csv   100%[===================>]   3.22K  --.-KB/s    in 0s      

2023-08-23 12:43:38 (46.5 MB/s) - ‘doggy-illness.csv’ saved [3293/3293]

    male  attended_training  age  body_fat_percentage  core_temperature  \
0      0                  1  6.9                   38         38.423169   
1      0                  1  5.4                   32         39.015998   
2      1                  1  5.4                   12         39.148341   
3      1                  0  4.8                   23         39.060049   
4      1                  0  4.8                   15         38.655439   
..   ...                ...  ...                  ...               ...   
93     0                  0  4.5                   38         37.939942   
94     1                  0  1.8                   11         38.790426   
95     0                  0  6.6                   20         39.489962   
96     0                  0  6.9                   32         38.575742   
97     1                  1  6.0                   21         39.766447   

    ate_at_tonys_steakhouse  needed_intensive_care  \
0                         0                      0   
1                         0                      0   
2                         0                      0   
3                         0                      0   
4                         0                      0   
..                      ...                    ...   
93                        0                      0   
94                        1                      1   
95                        0                      0   
96                        1                      1   
97                        1                      1   

    protein_content_of_last_meal  
0                           7.66  
1                          13.36  
2                          12.90  
3                          13.45  
4                          10.53  
..                           ...  
93                          7.35  
94                         12.18  
95                         15.84  
96                          9.79  
97                         21.30  

[98 rows x 8 columns]

We have a variety of information, including what the dogs did the night before, their age, whether they're overweight, and their clinical signs. In this exercise, our y values, or labels, are represented by the core_temperature column, while our feature will be the age in years.

Data visualization¶

Let's have a look at how the features and labels are distributed.

In [2]:
import graphing

graphing.histogram(dataset, label_x='age', nbins=10, title="Feature", show=True)
graphing.histogram(dataset, label_x='core_temperature', nbins=10, title="Label")

Looking at our feature (age), we can see dogs were at or less than 9 years of age, and ages are evenly distributed. In other words, no particular age is substantially more common than any other.

Looking at our label (core_temperature), most dogs seem to have a slightly elevated core temperature (we would normally expect ~37.5 degrees celcius), which indicates they're unwell. A small number of dogs have a temperature above 40 degrees, which indicates they're quite unwell.

Simply because the shape of these distributions is different, we can guess that the feature won't be able to predict the label extremely well. For example, if old age perfectly predicted who would have a high temperature, then the number of old dogs would exactly match the number of dogs with a high temperature.

The model might still end up being useful, though, so lets continue.

The next step is to eyeball the relationship. Let's plot relation between the labels and features.

In [3]:
graphing.scatter_2D(dataset, label_x="age", label_y="core_temperature", title='core temperature as a function of age')

It does seem that older dogs tended to have higher temperatures than younger dogs. The relationship is quite "noisy," though; many dogs of the same age have quite different temperatures.

Simple linear regression¶

Let's formally examine the relationship between our labels and features by fitting a line (simple linear-regression model) to the dataset.

In [4]:
import statsmodels.formula.api as smf
import graphing # custom graphing code. See our GitHub repo for details

# First, we define our formula using a special syntax
# This says that core temperature is explained by age
formula = "core_temperature ~ age"

# Perform linear regression. This method takes care of
# the entire fitting procedure for us.
model = smf.ols(formula = formula, data = dataset).fit()

# Show a graph of the result
graphing.scatter_2D(dataset,    label_x="age", 
                                label_y="core_temperature",
                                trendline=lambda x: model.params[1] * x + model.params[0]
                                )

The line seems to fit the data quite well, validating our hypothesis that there's a positive correlation between a dog's age and their core temperature.

Interpreting our model¶

Visually, simple linear regression is easy to understand. Let's recap on what the parameters mean, though.

In [5]:
print("Intercept:", model.params[0], "Slope:", model.params[1])
Intercept: 38.087867548892106 Slope: 0.15333957754731825

Remember that simple linear regression models are explained by the line intercept and the line slope.

Here, our intercept is 38 degrees celsius. This means that when age is 0, the model will predict 38 degrees.

Our slope is 0.15 degrees celsius, meaning that for every year of age, the model will predict temperatures 0.15 degrees higher.

In the following box, try to change the age to a few different values to see different predictions, and compare these with the line in the preceding graph.

In [6]:
def estimate_temperature(age):
    # Model param[0] is the intercepts and param[1] is the slope
    return age * model.params[1] + model.params[0]

print("Estimate temperature from age")
print(estimate_temperature(age=0))
Estimate temperature from age
38.087867548892106

Summary¶

We covered the following concepts in this exercise:

  • Quickly visualizing a dataset
  • Qualitatively assessing a linear relationship
  • Building a simple linear-regression model
  • Understanding parameters of a simple linear-regression model